Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support hashing vectorizer inside a union #176

Merged
merged 13 commits into from
May 2, 2017
Merged

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Apr 28, 2017

Also add an eli5.sklearn.invert_and_fit helper that inverts and fits a vectorizer (even if it's hiding in a union). The fix in 85c8a61 is not strictly related to this PR, I think it's an old issue revealed by new tests.

My primary motivation was deep-deep model and predictions explanation support.

Fixes #16

@@ -219,6 +219,17 @@ automatically; to handle HashingVectorizer_ and FeatureHasher_ for
# and ``ivec`` can be used as a vectorizer for eli5.explain_weights:
eli5.explain_weights(clf, vec=ivec)

HashingVectorizer_ is also supported inside a FeatureUnion_:
:func:`eli5.explain_prediction` handles this case automatically, and for
:func:`eli5.explain_weights` you can use :func:`eli5.sklearn.invert_and_fit``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an extra ` at the end of the line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for spotting it, it's so tiny! Fixed in 94d81e9

lopuhin added 2 commits April 28, 2017 23:51
Without this fix sklearn gives an error in
sklearn/feature_extraction/hashing.py, line 142:
TypeError: Expected bytes, got numpy.string_
@codecov-io
Copy link

codecov-io commented Apr 28, 2017

Codecov Report

Merging #176 into master will increase coverage by 0.09%.
The diff coverage is 98.3%.

@@            Coverage Diff             @@
##           master     #176      +/-   ##
==========================================
+ Coverage   97.25%   97.34%   +0.09%     
==========================================
  Files          39       39              
  Lines        2405     2450      +45     
  Branches      452      464      +12     
==========================================
+ Hits         2339     2385      +46     
  Misses         34       34              
+ Partials       32       31       -1
Impacted Files Coverage Δ
eli5/sklearn/__init__.py 100% <100%> (ø) ⬆️
eli5/sklearn/utils.py 87.38% <100%> (-0.34%) ⬇️
eli5/sklearn/unhashing.py 96.95% <98.14%> (+2.12%) ⬆️

@lopuhin lopuhin changed the title [WIP] Support hashing vectorizer inside a union Support hashing vectorizer inside a union Apr 30, 2017
@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 30, 2017

@kmike this is ready for review, could you please check it again? I updated the PR description and added a py2 fix.

This feature could be less complicated if InvertableHashingVectorizer would return feature names as strings, not as lists of {'name': , 'sign': } dicts - in this case we would not need code that is currently in _invhashing_union_feature_names_scale. But in this case we would loose the ability to format inverted features nicely, and maybe also work with their signs, etc.

@lopuhin
Copy link
Contributor Author

lopuhin commented Apr 30, 2017

The main concert I have here is the name of the invert_and_fit function - it makes sense as a function in eli5.sklearn.unhashing, but maybe for eli5.sklearn it needs a more descriptive name? We'll be making it public, so it would be harder to change the name.

""" Create an InvertableHashingVectorizer from hashing vectorizer vec
and fit it on docs. If vec is a FeatureUnion, do it for all
hashing vectorizers in the union.
Returns an InvertableHashingVectorizer, or a Union, or an unchanged vectorizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union -> FeatureUnion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed in b89d0f3. Thanks!

@kmike
Copy link
Contributor

kmike commented May 2, 2017

Yeah, I think it should be named either invert_hashing_and_fit, or imported & documented consistently as sklearn.hashing.invert_and_fit. The former is probably easier.

Thanks @kmike for suggestion!
Also fix docstring: Union -> FeatureUnion
@lopuhin
Copy link
Contributor Author

lopuhin commented May 2, 2017

Yes, invert_hashing_and_fit sounds much better than invert_and_fit, thanks!
Changed the name in b89d0f3

@kmike
Copy link
Contributor

kmike commented May 2, 2017

Looks good, thanks @lopuhin!

@kmike kmike merged commit 73f0ac2 into master May 2, 2017
@lopuhin
Copy link
Contributor Author

lopuhin commented May 2, 2017

Thanks for review @kmike !

@lopuhin lopuhin deleted the union-hashing-vec-3 branch May 2, 2017 13:07
@kmike kmike added this to the 0.6 milestone May 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support FeatureUnion for InvertableHashingVectorizer
3 participants